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ABSTRACT 

Motivation: A common question in genomic analysis is whether two 
sets of genomic intervals overlap significantly. This question arises, for 
example, when interpreting ChlP-Seq or RNA-Seq data in functional 
terms. Because genome organization is complex, answering this 
question is non-trivial. 

Summary: We present Genomic Association Test (GAT), a tool for 
estimating the significance of overlap between multiple sets of gen- 
omic intervals. GAT implements a null model that the two sets of 
intervals are placed independently of one another, but allows each 
set's density to depend on external variables, for example, isochore 
structure or chromosome identity. GAT estimates statistical signifi- 
cance based on simulation and controls for multiple tests using the 
false discovery rate. 

Availability: GAT's source code, documentation and tutorials are 
available at http://code.google.eom/p/genomic-association-tester. 
Contact: andreas.heger@dpag.ox.ac.uk 
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1 INTRODUCTION 

A common question in genomic analysis is whether two sets of 
genomic intervals, for example, ChlP-seq peaks and gene anno- 
tation classes, overlap significantly more than expected by 
chance alone. Interval overlap is easy to compute, but the sig- 
nificance can be computed analytically only for trivial situations. 
Hence, significance is usually estimated by simulation under 
some null model. This model must account for genome organ- 
ization; a model that assumes independent and uniform place- 
ment of both interval sets is almost always inappropriate when 
testing for association with gene annotations because gene dens- 
ity strongly correlates with G + C content, and datasets of inter- 
est often also show G + C biases. 

Here, we introduce Genomic Association Test (GAT), a tool 
for computing the significance of overlap between multiple 
sets of genomic intervals. GAT permits the restriction of the 
analysis to parts of a genome relevant to the experiment and 
accounts for chromosomal and isochore biases. Additional gen- 
omic features can be controlled for by providing additional seg- 
mentation files. 
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GAT's approach was developed originally to test for the as- 
sociation of non-coding transcripts with other genomic elements 
(Ponjavic et al, 2007), but has since been applied to a variety of 
problems, including: 

• Conservation of non-coding transcription between human 
and mouse (Church et al, 2009); 

• Enrichment of histone marks and evolutionarily conserved 
genomic regions within non-coding transcripts (Marques 
and Ponting, 2009); 

• Functional prediction of non-coding transcripts via their 
neighboring genes (Marques and Ponting, 2009); and 

• Enrichment of ChlP-Seq binding events within signatures of 
open chromatin or disease-associated intervals (Ramago- 
palan et al, 2010). 

GAT's re-implementation delivers to the scientific community 
the extended functionality of the Ponjavic et al. (2007) methods. 

2 USAGE 

GAT is controlled from the command line. It requires at least 
three bed-formatted files that delimit genomic intervals (tuples of 
chromosome, start and end). The principal output of GAT is a 
table listing significant overlaps. 

2.1 Input 

Example: does a set of transcription factor binding site intervals 
from a ChlP-Seq experiment overlaps more than expected by 
chance with a set of DNasel-hypersensitive sites? To perform 
this analysis, GAT requires three files: 

(1) A bed-formatted file with the intervals from the ChlP-Seq 
experiment (Segments S). Several experiments can be sup- 
plied as multiple files or as a single file with multiple 
tracks. 

(2) A bed-formatted file with DNasel-hypersensitive sites 
{Annotations A). These could be obtained directly from 
the UCSC Genome Browser (Rosenbloom et al., 2012). 
Several annotations from, for example, multiple cell lines 
can be supplied as multiple files or as a single file with 
multiple tracks. 

(3) A bed-formatted file with the workspace (W). The work- 
space defines the sequence that is accessible for the 
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simulation. The simplest workspace contains the full 
genome assembly. In this example, the analysis should be 
restricted to only repeat-free regions, as only these are re- 
liably mappable by short read data and thus could contain 
ChlP-Seq intervals. Again, appropriate bed-formatted files 
are available from the UCSC Genome Browser. 

By default, the randomization procedure accounts for differ- 
ences among chromosomes; for example, the X chromosome 
contains many sequence features that are atypical of autosomes. 
In addition to chromosome identity, local genomic G+C con- 
tent is another common confounding factor. For example, G + C 
content might cause experimental biases in sequencing and 
hybridization protocols, while it is also a correlate of gene density 
(Lander et ah, 2001). To correct for G + C content, an optional 
bed-formatted file with the isochore structure of the genome can 
be supplied. GAT will then normalize by isochore and by 
chromosome. Here, isochores are discretized, for example, the 
genome is partitioned into windows falling into eight bins of 
different regional G + C content. 

2.2 Output 

In the aforementioned example, GAT will compute the overlap 
of ChlP-Seq binding events and DNasel-hypersensitive sites. 
GAT will also estimate if the overlap is larger or smaller than 
expected by chance and will provide an empirical P-value of the 
statistical significance. If multiple ChlP-Seq experiments or mul- 
tiple annotations have been submitted, GAT will compute the 
overlap for each combination of experiment and annotation and 
will estimate its significance. Storey's g-value (Storey and 
Tibshirani, 2003) or the Benjamini-Hochberg method 
(Benjamini and Hochberg, 1995) is used to control for multiple 
testing using a False discovery rate (FDR) procedure. 

3 IMPLEMENTATION 

3.1 Overview 

GAT is a python script (http://python.org) requiring only 
common and freely available numerical and scientific libraries. 
The memory and time-critical parts are implemented in cython 
(http://cython.org). It requires two collections of genomic inter- 
vals: Segments (S) and Annotations (A). Each collection can con- 
tain one or more lists of genomic intervals (Si, S%, . . ., S m ; A u 
A 2 , . . . ,A n ). Intervals within a list of genomic intervals are 
required to be non-overlapping, and any overlapping intervals 
within S or A are merged prior to analysis. In addition, GAT 
requires a Workspace W describing the part of the genome ac- 
cessible to the simulation. The analysis proceeds as follows. For 
each pair of interval lists S x and A y (xe {1 , . . . ,m}, 
y e {1 , . . . , »}), GAT computes the overlap between the intervals 
in S v and A y within workspace W: observed = \ S x n A y n W\ . 
| Here, | is the overlap operator and defaults to the number of 
nucleotides overlapping, but other operators (such as the 
number of segments) can be used. GAT subsequently creates 
randomly placed intervals in the genome with the same size dis- 
tribution of S x within the workspace W. See below for simulation 
details. The overlap between each simulated set and A y is re- 
corded. The average over all simulations represents the expected 



overlap. GAT reports the fold enrichment as the ratio of observed 
and expected overlap and associates an empirical P-value with it. 
GATs runtime and memory usage scale linearly with the 
number of simulations and the number and size of the genomic 
interval sets S, A and W. 

3.2 Sampling method 

The sampling method creates a list R of randomly placed inter- 
vals from an interval list S x within a workspace W. The sampling 
is done on a per-chromosome basis. For each chromosome c, 
randomly placed intervals are created by a two-step procedure: 

(1) Select an interval size from the empirical interval size dis- 
tribution S xc . 

(2) Select a position within the workspace W c . 

Sampled intervals are added to R c until exactly the same 
number of nucleotides are in R c as are in S XjC . For reasons of 
performance, intervals are initially sampled without checking for 
overlap. Overlaps and overshoot are subsequently resolved in an 
iterative procedure once the sampled number of nucleotides ap- 
proximates the target number. 

The current sampling protocol is restricted to non-overlapping 
single segment intervals. Although amenable to many genomic 
features, it notably leaves discontinuous genomic segments, such 
as transcripts, untreated. 

3.3 Isochores 

Isochores are defined within GAT as chromosomal segments 
within a workspace. For each isochore i, the workspace W is 
subdivided into a workspace Wi=WC\Ii. The sampling is 
performed separately for each W : and samples combined at the 
end. Isochores are thus treated in an equivalent manner to 
chromosomes. Isochores can be defined by G + C content, but 
can reflect any segmentation of the genome, such as chromatin 
marks. 

4 CONCLUSIONS 

GAT provides critical functionality for genomic analyses. By 
using standard BED files, it may be used alongside major data 
resources, such as the UCSC Genome Browser and Galaxy 
(Giardine et ai, 2005). GAT can be used in a similar context 
to GREAT (McLean et ai, 2010) and other tools, but can ad- 
dress a more diverse range of questions because of its simulation 
approach that takes into account both segment and annotation 
size distributions. 
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